Pesquisa | Portal Regional da BVS

1.

DPI_CDF: druggable protein identifier using cascade deep forest.

Arif, Muhammad; Fang, Ge; Ghulam, Ali; Musleh, Saleh; Alam, Tanvir.

BMC Bioinformatics ; 25(1): 145, 2024 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-38580921

RESUMO

BACKGROUND: Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory. METHODS: In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. RESULTS: The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. AVAILABILITY: The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .

Assuntos

Proteínas , Software , Sequência de Aminoácidos , Matrizes de Pontuação de Posição Específica , Evolução Biológica , Biologia Computacional/métodos

2.

Modeling Peptide-Protein Interactions by a Logo-Based Method: Application in Peptide-HLA Binding Predictions.

Doytchinova, Irini; Atanasova, Mariyana; Fernandez, Antonio; Moreno, F Javier; Koning, Frits; Dimitrov, Ivan.

Molecules ; 29(2)2024 Jan 05.

Artigo em Inglês | MEDLINE | ID: mdl-38257197

RESUMO

Peptide-protein interactions form a cornerstone in molecular biology, governing cellular signaling, structure, and enzymatic activities in living organisms. Improving computational models and experimental techniques to describe and predict these interactions remains an ongoing area of research. Here, we present a computational method for peptide-protein interactions' description and prediction based on leveraged amino acid frequencies within specific binding cores. Utilizing normalized frequencies, we construct quantitative matrices (QMs), termed 'logo models' derived from sequence logos. The method was developed to predict peptide binding to HLA-DQ2.5 and HLA-DQ8.1 proteins associated with susceptibility to celiac disease. The models were validated by more than 17,000 peptides demonstrating their efficacy in discriminating between binding and non-binding peptides. The logo method could be applied to diverse peptide-protein interactions, offering a versatile tool for predictive analysis in molecular binding studies.

Assuntos

Doença Celíaca , Peptídeos , Humanos , Aminoácidos , Biologia Molecular , Matrizes de Pontuação de Posição Específica

3.

plotnineSeqSuite: a Python package for visualizing sequence data using ggplot2 style.

Cao, Tianze; Li, Qian; Huang, Yuexia; Li, Anshui.

BMC Genomics ; 24(1): 585, 2023 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-37789265

RESUMO

BACKGROUND: The visual sequence logo has been a hot area in the development of bioinformatics tools. ggseqlogo written in R language has been the most popular API since it was published. With the popularity of artificial intelligence and deep learning, Python is currently the most popular programming language. The programming language used by bioinformaticians began to shift to Python. Providing APIs in Python that are similar to those in R can reduce the learning cost of relearning a programming language. And compared to ggplot2 in R, drawing framework is not as easy to use in Python. The appearance of plotnine (ggplot2 in Python version) makes it possible to unify the programming methods of bioinformatics visualization tools between R and Python. RESULTS: Here, we introduce plotnineSeqSuite, a new plotnine-based Python package provides a ggseqlogo-like API for programmatic drawing of sequence logos, sequence alignment diagrams and sequence histograms. To be more precise, it supports custom letters, color themes, and fonts. Moreover, the class for drawing layers is based on object-oriented design so that users can easily encapsulate and extend it. CONCLUSIONS: plotnineSeqSuite is the first ggplot2-style package to implement visualization of sequence -related graphs in Python. It enhances the uniformity of programmatic plotting between R and Python. Compared with tools appeared already, the categories supported by plotnineSeqSuite are much more complete. The source code of plotnineSeqSuite can be obtained on GitHub ( https://github.com/caotianze/plotnineseqsuite ) and PyPI ( https://pypi.org/project/plotnineseqsuite ), and the documentation homepage is freely available on GitHub at ( https://caotianze.github.io/plotnineseqsuite/ ).

Assuntos

Inteligência Artificial , Software , Linguagens de Programação , Biologia Computacional , Matrizes de Pontuação de Posição Específica

4.

Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.

Liu, Daniel; Steinegger, Martin.

Bioinformatics ; 39(8)2023 08 01.

Artigo em Inglês | MEDLINE | ID: mdl-37535681

RESUMO

MOTIVATION: Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions). RESULTS: We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities. AVAILABILITY AND IMPLEMENTATION: Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.

Assuntos

Algoritmos , Proteínas , Matrizes de Pontuação de Posição Específica , Alinhamento de Sequência , Análise de Sequência , Software

5.

A convolutional network and attention mechanism-based approach to predict protein-RNA binding residues.

Li, Ke; Wu, Hongwei; Yue, Zhenyu; Sun, Yu; Xia, Chuan.

Comput Biol Chem ; 105: 107901, 2023 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-37327559

RESUMO

Protein-RNA interactions play a key role in various biological cellular processes, and many experimental and computational studies have been initiated to analyze their interactions. However, experimental determination is quite complex and expensive. Therefore, researchers have worked to develop efficient computational tools to detect protein-RNA binding residues. The accuracy of existing methods is limited by the features of the target and the performance of the computational models; there remains room for improvement. To solve the problem of the accurate detection of protein-RNA binding residues, we propose a convolutional network model named PBRPre based on improved MobileNet. First, by extracting the position information of the target complex and the 3-mer amino acid feature data, the position-specific scoring matrix (PSSM) is improved by using spatial neighbor smoothing processing and discrete wavelet transform to fully exploit the spatial structure information of the target and enrich the feature dataset. Second, the deep learning model MobileNet is used to integrate and optimize the potential features in the target complexes; then, by introducing the Vision Transformer (ViT) network classification layer, the deep-level information of the target is mined to enhance the processing ability of the model for global information and to improve the detection accuracy of the classifiers. The results show that the AUC value of the model can reach 0.866 in the independent testing dataset, which shows that PBRPre can effectively realize the detection of protein-RNA binding residues. All datasets and resource codes of PBRPre are available at https://github.com/linglewu/PBRPre for academic use.

Assuntos

Aminoácidos , RNA , RNA/química , Ligação Proteica , Aminoácidos/metabolismo , Matrizes de Pontuação de Posição Específica

6.

Integrating reduced amino acid composition into PSSM for improving copper ion-binding protein prediction.

Liu, Shanghua; Liang, Yuchao; Li, Jinzhao; Yang, Siqi; Liu, Ming; Liu, Chengfang; Yang, Dezhi; Zuo, Yongchun.

Int J Biol Macromol ; 244: 124993, 2023 Jul 31.

Artigo em Inglês | MEDLINE | ID: mdl-37307968

RESUMO

Copper ion-binding proteins play an essential role in metabolic processes and are critical factors in many diseases, such as breast cancer, lung cancer, and Menkes disease. Many algorithms have been developed for predicting metal ion classification and binding sites, but none have been applied to copper ion-binding proteins. In this study, we developed a copper ion-bound protein classifier, RPCIBP, which integrating the reduced amino acid composition into position-specific scoring matrix (PSSM). The reduced amino acid composition filters out a large number of useless evolutionary features, improving the operational efficiency and predictive ability of the model (feature dimension from 2900 to 200, ACC from 83 % to 85.1 %). Compared with the basic model using only three sequence feature extraction methods (ACC in training set between 73.8 %-86.2 %, ACC in test set between 69.3 %-87.5 %), the model integrating the evolutionary features of the reduced amino acid composition showed higher accuracy and robustness (ACC in training set between 83.1 %-90.8 %, ACC in test set between 79.1 %-91.9 %). Best copper ion-binding protein classifiers filtered by feature selection progress were deployed in a user-friendly web server (http://bioinfor.imu.edu.cn/RPCIBP). RPCIBP can accurately predict copper ion-binding proteins, which is convenient for further structural and functional studies, and conducive to mechanism exploration and target drug development.

Assuntos

Cobre , Proteínas , Matrizes de Pontuação de Posição Específica , Proteínas/química , Algoritmos , Aminoácidos/química , Bases de Dados de Proteínas , Biologia Computacional/métodos

7.

Deep-AGP: Prediction of angiogenic protein by integrating two-dimensional convolutional neural network with discrete cosine transform.

Ali, Farman; Alghamdi, Wajdi; Almagrabi, Alaa Omran; Alghushairy, Omar; Banjar, Ameen; Khalid, Majdi.

Int J Biol Macromol ; 243: 125296, 2023 Jul 15.

Artigo em Inglês | MEDLINE | ID: mdl-37301349

RESUMO

Angiogenic proteins (AGPs) play a primary role in the formation of new blood vessels from pre-existing ones. AGPs have diverse applications in cancer, including serving as biomarkers, guiding anti-angiogenic therapies, and aiding in tumor imaging. Understanding the role of AGPs in cardiovascular and neurodegenerative diseases is vital for developing new diagnostic tools and therapeutic approaches. Considering the significance of AGPs, in this research, we first time established a computational model using deep learning for identifying AGPs. First, we constructed a sequence-based dataset. Second, we explored features by designing a novel feature encoder, called position-specific scoring matrix-decomposition-discrete cosine transform (PSSM-DC-DCT) and existing descriptors including Dipeptide Deviation from Expected Mean (DDE) and bigram-position-specific scoring matrix (Bi-PSSM). Third, each feature set is fed into two-dimensional convolutional neural network (2D-CNN) and machine learning classifiers. Finally, the performance of each learning model is validated by 10-fold cross-validation (CV). The experimental results demonstrate that 2D-CNN with proposed novel feature descriptor achieved the highest success rate on both training and testing datasets. In addition to being an accurate predictor for identification of angiogenic proteins, our proposed method (Deep-AGP) might be fruitful in understanding cancer, cardiovascular, and neurodegenerative diseases, development of their novel therapeutic methods and drug designing.

Assuntos

Aprendizado de Máquina , Redes Neurais de Computação , Matrizes de Pontuação de Posição Específica

8.

Stack-VTP: prediction of vesicle transport proteins based on stacked ensemble classifier and evolutionary information.

Chen, Yu; Gao, Lixin; Zhang, Tianjiao.

BMC Bioinformatics ; 24(1): 137, 2023 Apr 07.

Artigo em Inglês | MEDLINE | ID: mdl-37029385

RESUMO

Vesicle transport proteins not only play an important role in the transmembrane transport of molecules, but also have a place in the field of biomedicine, so the identification of vesicle transport proteins is particularly important. We propose a method based on ensemble learning and evolutionary information to identify vesicle transport proteins. Firstly, we preprocess the imbalanced dataset by random undersampling. Secondly, we extract position-specific scoring matrix (PSSM) from protein sequences, and then further extract AADP-PSSM and RPSSM features from PSSM, and use the Max-Relevance-Max-Distance (MRMD) algorithm to select the optimal feature subset. Finally, the optimal feature subset is fed into the stacked classifier for vesicle transport proteins identification. The experimental results show that the of accuracy (ACC), sensitivity (SN) and specificity (SP) of our method on the independent testing set are 82.53%, 0.774 and 0.836, respectively. The SN, SP and ACC of our proposed method are 0.013, 0.007 and 0.76% higher than the current state-of-the-art methods.

Assuntos

Algoritmos , Matrizes de Pontuação de Posição Específica , Proteínas de Transporte Vesicular , Sequência de Aminoácidos , Proteínas de Transporte , Máquina de Vetores de Suporte , Proteínas de Transporte Vesicular/genética , Proteínas de Transporte Vesicular/isolamento & purificação

9.

dipwmsearch: a Python package for searching di-PWM motifs.

Mille, Marie; Ripoll, Julie; Cazaux, Bastien; Rivals, Eric.

Bioinformatics ; 39(4)2023 04 03.

Artigo em Inglês | MEDLINE | ID: mdl-37010504

RESUMO

MOTIVATION: Seeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites or other RNA/DNA binding sites. Useful motif representations include position weight matrices (PWMs), dinucleotide PWMs (di-PWMs), and hidden Markov models (HMMs). Dinucleotide PWMs not only combine the simplicity of PWMs-a matrix form and a cumulative scoring function-but also incorporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences. RESULTS: We propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then searches these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs. AVAILABILITY AND IMPLEMENTATION: dipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.

Assuntos

Algoritmos , Biologia Computacional , Sítios de Ligação , Ligação Proteica , Matrizes de Pontuação de Posição Específica

10.

ResidualBind: Uncovering Sequence-Structure Preferences of RNA-Binding Proteins with Deep Neural Networks.

Koo, Peter K; Ploenzke, Matt; Anand, Praveen; Paul, Steffan; Majdandzic, Antonio.

Methods Mol Biol ; 2586: 197-215, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36705906

RESUMO

Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.

Assuntos

Redes Neurais de Computação , RNA , RNA/genética , Proteínas de Ligação a RNA/metabolismo , DNA/metabolismo , Matrizes de Pontuação de Posição Específica , Ligação Proteica

11.

Empirical Study of Protein Feature Representation on Deep Belief Networks Trained With Small Data for Secondary Structure Prediction.

Rashid, Shamima; Sundaram, Suresh; Kwoh, Chee Keong.

IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 955-966, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-35439138

RESUMO

Protein secondary structure (SS) prediction is a classic problem of computational biology and is widely used in structural characterization and to infer homology. While most SS predictors have been trained on thousands of sequences, a previous approach had developed a compact model of training proteins that used a C-Alpha, C-Beta Side Chain (CABS)-algorithm derived energy based feature representation. Here, the previous approach is extended to Deep Belief Networks (DBN). Deep learning methods are notorious for requiring large datasets and there is a wide consensus that training deep models from scratch on small datasets, works poorly. By contrast, we demonstrate a simple DBN architecture containing a single hidden layer, trained only on the CB513 dataset. Testing on an independent set of G Switch proteins improved the Q 3 score of the previous compact model by almost 3%. The findings are further confirmed by comparison to several deep learning models which are trained on thousands of proteins. Finally, the DBN performance is also compared with Position Specific Scoring Matrix (PSSM)-profile based feature representation. The importance of (i) structural information in protein feature representation and (ii) complementary small dataset learning approaches for detection of structural fold switching are demonstrated.

Assuntos

Algoritmos , Biologia Computacional , Consenso , Matrizes de Pontuação de Posição Específica , Domínios Proteicos

12.

Predicting TF Proteins by Incorporating Evolution Information Through PSSM.

Du, Zhihua; Huang, Tianyou; Uversky, Vladimir N; Li, Jianqiang.

IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1319-1326, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-35981062

RESUMO

Transcription factors (TFs) are DNA binding proteins involved in the regulation of gene expression. They exist in all organisms and activate or repress transcription by binding to specific DNA sequences. Traditionally, TFs have been identified by experimental methods that are time-consuming and costly. In recent years, various computational methods have been developed to identify TF to overcome these limitations. However, there is a room for further improvement in the predictive performance of these tools in terms of accuracy. We report here a novel computational tool, TFnet, that provides accurate and comprehensive TF predictions from protein sequences. The accuracy of these predictions is substantially better than the results of the existing TF predictors and methods. Especially, it outperforms comparable methods significantly when sequence similarity to other known sequences in the database drops below 40%. Ablation tests reveal that the high predictive performance stems from innovative ways used in TFnet to derive sequence Position-Specific Scoring Matrix (PSSM) and encode inputs.

Assuntos

Proteínas de Ligação a DNA , Fatores de Transcrição , Matrizes de Pontuação de Posição Específica , Fatores de Transcrição/metabolismo , Proteínas de Ligação a DNA/metabolismo

13.

Identification of adaptor proteins by incorporating deep learning and PSSM profiles.

Gao, Wentao; Xu, Dali; Li, Hongfei; Du, Junping; Wang, Guohua; Li, Dan.

Methods ; 209: 10-17, 2023 01.

Artigo em Inglês | MEDLINE | ID: mdl-36427763

RESUMO

Adaptor proteins, also known as signal transduction adaptor proteins, are important proteins in signal transduction pathways, and play a role in connecting signal proteins for signal transduction between cells. Studies have shown that adaptor proteins are closely related to some diseases, such as tumors and diabetes. Therefore, it is very meaningful to construct a relevant model to accurately identify adaptor proteins. In recent years, many studies have used a position-specific scoring matrix (PSSM) and neural network methods to identify adaptor proteins. However, ordinary neural network models cannot correlate the contextual information in PSSM profiles well, so these studies usually process 20×N (N > 20) PSSM into 20×20 dimensions, which results in the loss of a large amount of protein information; This research proposes an efficient method that combines one-dimensional convolution (1-D CNN) and a bidirectional long short-term memory network (biLSTM) to identify adaptor proteins. The complete PSSM profiles are the input of the model, and the complete information of the protein is retained during the training process. We perform cross-validation during model training and test the performance of the model on an independent test set; in the data set with 1224 adaptor proteins and 11,078 non-adaptor proteins, five indicators including specificity, sensitivity, accuracy, area under the receiver operating characteristic curve (AUC) metric and Matthews correlation coefficient (MCC), were employed to evaluate model performance. On the independent test set, the specificity, sensitivity, accuracy and MCC were 0.817, 0.865, 0.823 and 0.465, respectively. Those results show that our method is better than the state-of-the art methods. This study is committed to improve the accuracy of adaptor protein identification, and laid a foundation for further research on diseases related to adaptor protein. This research provided a new idea for the application of deep learning related models in bioinformatics and computational biology.

Assuntos

Aprendizado Profundo , Matrizes de Pontuação de Posição Específica , Redes Neurais de Computação , Software , Proteínas Adaptadoras de Transdução de Sinal , Algoritmos

14.

Optimized Data Set and Feature Construction for Substrate Prediction of Membrane Transporters.

Denger, Andreas; Helms, Volkhard.

J Chem Inf Model ; 62(23): 6242-6257, 2022 Dec 12.

Artigo em Inglês | MEDLINE | ID: mdl-36454173

RESUMO

α-Helical transmembrane proteins termed membrane transporters mediate the passage of small hydrophilic substrate molecules across biological lipid bilayer membranes. Annotating the specific substrates of the dozens to hundreds of individual transporters of an organism is an important task. In the past, machine learning classifiers have been successfully trained on pan-organism data sets to predict putative substrates of transporters. Here, we critically examine the selection of an optimal data set of protein sequence features for the classification task. We focus on membrane transporters of the three model organisms Escherichia coli, Arabidopsis thaliana, and Saccharomyces cerevisiae, as well as human. We show that organism-specific classifiers can be robustly trained if at least 20 samples are available for each substrate class. If information from position-specific scoring matrices is included, such classifiers have F1 scores between 0.85 and 1.00. For the largest data set (A. thaliana), a 4-class classifier yielded an F-score of 0.97. On a pan-organism data set composed of transporters of all four organisms, amino acid and sugar transporters were predicted with an F1 score of 0.91.

Assuntos

Arabidopsis , Proteínas de Membrana Transportadoras , Humanos , Proteínas de Membrana Transportadoras/metabolismo , Arabidopsis/química , Saccharomyces cerevisiae/metabolismo , Matrizes de Pontuação de Posição Específica , Aprendizado de Máquina

15.

Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier.

Li, Yang; Hu, Xue-Gang; You, Zhu-Hong; Li, Li-Ping; Li, Pei-Pei; Wang, Yan-Bin; Huang, Yu-An.

BMC Bioinformatics ; 23(Suppl 7): 518, 2022 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-36457083

RESUMO

BACKGROUND: Self-interacting proteins (SIPs), two or more copies of the protein that can interact with each other expressed by one gene, play a central role in the regulation of most living cells and cellular functions. Although numerous SIPs data can be provided by using high-throughput experimental techniques, there are still several shortcomings such as in time-consuming, costly, inefficient, and inherently high in false-positive rates, for the experimental identification of SIPs even nowadays. Therefore, it is more and more significant how to develop efficient and accurate automatic approaches as a supplement of experimental methods for assisting and accelerating the study of predicting SIPs from protein sequence information. RESULTS: In this paper, we present a novel framework, termed GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences. More specifically, we firstly convert the protein sequence into Position Specific Scoring Matrix (PSSM) containing protein sequence evolutionary information, exploiting the Position Specific Iterated BLAST (PSI-BLAST) tool. Secondly, using an efficient feature extraction approach, i.e., GLCM, we extract abstract salient and invariant feature vectors from the PSSM, and then perform a pre-processing operation, the adaptive synthetic (ADASYN) technique, to balance the SIPs dataset to generate new feature vectors for classification. Finally, we employ an efficient and reliable WSRC model to identify SIPs according to the known information of self-interacting and non-interacting proteins. CONCLUSIONS: Extensive experimental results show that the proposed approach exhibits high prediction performance with 98.10% accuracy on the yeast dataset, and 91.51% accuracy on the human dataset, which further reveals that the proposed model could be a useful tool for large-scale self-interacting protein prediction and other bioinformatics tasks detection in the future.

Assuntos

Evolução Biológica , Biologia Computacional , Humanos , Sequência de Aminoácidos , Matrizes de Pontuação de Posição Específica , Leucócitos , Saccharomyces cerevisiae/genética

16.

Prediction of antifreeze proteins using machine learning.

Khan, Adnan; Uddin, Jamal; Ali, Farman; Ahmad, Ashfaq; Alghushairy, Omar; Banjar, Ameen; Daud, Ali.

Sci Rep ; 12(1): 20672, 2022 11 30.

Artigo em Inglês | MEDLINE | ID: mdl-36450775

RESUMO

Living organisms including fishes, microbes, and animals can live in extremely cold weather. To stay alive in cold environments, these species generate antifreeze proteins (AFPs), also referred to as ice-binding proteins. Moreover, AFPs are extensively utilized in many important fields including medical, agricultural, industrial, and biotechnological. Several predictors were constructed to identify AFPs. However, due to the sequence and structural heterogeneity of AFPs, correct identification is still a challenging task. It is highly desirable to develop a more promising predictor. In this research, a novel computational method, named AFP-LXGB has been proposed for prediction of AFPs more precisely. The information is explored by Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Position Specific Scoring Matrix-Segmentation-Autocorrelation Transformation (Sg-PSSM-ACT), and Pseudo Position Specific Scoring Matrix Tri-Slicing (PseTS-PSSM). Keeping the benefits of ensemble learning, these feature sets are concatenated into different combinations. The best feature set is selected by Extremely Randomized Tree-Recursive Feature Elimination (ERT-RFE). The models are trained by Light eXtreme Gradient Boosting (LXGB), Random Forest (RF), and Extremely Randomized Tree (ERT). Among classifiers, LXGB has obtained the best prediction results. The novel method (AFP-LXGB) improved the accuracies by 3.70% and 4.09% than the best methods. These results verified that AFP-LXGB can predict AFPs more accurately and can participate in a significant role in medical, agricultural, industrial, and biotechnological fields.

Assuntos

Proteínas Anticongelantes , alfa-Fetoproteínas , Animais , Aprendizado de Máquina , Matrizes de Pontuação de Posição Específica , Agricultura

17.

ggmotif: An R Package for the extraction and visualization of motifs from MEME software.

Li, Xiang; Ma, Linna; Mei, Xinyue; Liu, Yixiang; Huang, Huichuan.

PLoS One ; 17(11): e0276979, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36327240

RESUMO

MEME (Multiple Em for Motif Elicitation) is the most commonly used tool to identify motifs within deoxyribonucleic acid (DNA) or protein sequences. However, the results generated by the MEMEare saved using file formats .xml and .txt, which are difficult to read, visualize, or integrate with other widely used phylogenetic tree packages, such as ggtree. To overcome this problem, we developed the ggmotif R package, which provides two easy-to-use functions that can facilitate the extraction and visualization of motifs from the results files generated by the MEME. ggmotif can extract the information of the location of motif(s) on the corresponding sequence(s) from the .xml format file and visualize it. Additionally, the data extracted by ggmotif can be easily integrated with the phylogenetic data. On the other hand, ggmotif can obtain the sequence of each motif from the .txt format file and draw the sequence logo with the function ggseqlogo from the ggseqlogo R package. The ggmotif R package is freely available (including examples and vignettes) from GitHub at https://github.com/lixiang117423/ggmotif or from CRAN at https://CRAN.R-project.org/package=ggmotif.

Assuntos

Software , Filogenia , Matrizes de Pontuação de Posição Específica , Sequência de Aminoácidos

18.

Identifying SNARE Proteins Using an Alignment-Free Method Based on Multiscan Convolutional Neural Network and PSSM Profiles.

Kha, Quang-Hien; Ho, Quang-Thai; Le, Nguyen Quoc Khanh.

J Chem Inf Model ; 62(19): 4820-4826, 2022 10 10.

Artigo em Inglês | MEDLINE | ID: mdl-36166351

RESUMO

Background: SNARE proteins play a vital role in membrane fusion and cellular physiology and pathological processes. Many potential therapeutics for mental diseases or even cancer based on SNAREs are also developed. Therefore, there is a dire need to predict the SNAREs for further manipulation of these essential proteins, which demands new and efficient approaches. Methods: Some computational frameworks were proposed to tackle the hurdles of biological methods, which take plenty of time and budget to conduct the identification of SNAREs. However, the performances of existing frameworks were insufficiently satisfied, as they failed to retain the SNARE sequence order and capture the mass hidden features from SNAREs. This paper proposed a novel model constructed on the multiscan convolutional neural network (CNN) and position-specific scoring matrix (PSSM) profiles to address these limitations. We employed and trained our model on the benchmark dataset with fivefold cross-validation and two different independent datasets. Results: Overall, the multiscan CNN was cross-validated on the training set and excelled in the SNARE classification reaching 0.963 in AUC and 0.955 in AUPRC. On top of that, with the sensitivity, specificity, accuracy, and MCC of 0.842, 0.968, 0.955, and 0.767, respectively, our proposed framework outperformed previous models in the SNARE recognition task. Conclusions: It is truly believed that our model can contribute to the discrimination of SNARE proteins and general proteins.

Assuntos

Redes Neurais de Computação , Proteínas SNARE , Matrizes de Pontuação de Posição Específica

19.

Top-Down Crawl: a method for the ultra-rapid and motif-free alignment of sequences with associated binding metrics.

Cooper, Brendon H; Chiu, Tsu-Pei; Rohs, Remo.

Bioinformatics ; 38(22): 5121-5123, 2022 11 15.

Artigo em Inglês | MEDLINE | ID: mdl-36179084

RESUMO

SUMMARY: Several high-throughput protein-DNA binding methods currently available produce highly reproducible measurements of binding affinity at the level of the k-mer. However, understanding where a k-mer is positioned along a binding site sequence depends on alignment. Here, we present Top-Down Crawl (TDC), an ultra-rapid tool designed for the alignment of k-mer level data in a rank-dependent and position weight matrix (PWM)-independent manner. As the framework only depends on the rank of the input, the method can accept input from many types of experiments (protein binding microarray, SELEX-seq, SMiLE-seq, etc.) without the need for specialized parameterization. Measuring the performance of the alignment using multiple linear regression with 5-fold cross-validation, we find TDC to perform as well as or better than computationally expensive PWM-based methods. AVAILABILITY AND IMPLEMENTATION: TDC can be run online at https://topdowncrawl.usc.edu or locally as a python package available through pip at https://pypi.org/project/TopDownCrawl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Software , Matrizes de Pontuação de Posição Específica , Sítios de Ligação , Análise de Sequência de DNA/métodos , Ligação Proteica

20.

A deep learning-based method for the prediction of DNA interacting residues in a protein.

Patiyal, Sumeet; Dhall, Anjali; Raghava, Gajendra P S.

Brief Bioinform ; 23(5)2022 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-35943134

RESUMO

DNA-protein interaction is one of the most crucial interactions in the biological system, which decides the fate of many processes such as transcription, regulation and splicing of genes. In this study, we trained our models on a training dataset of 646 DNA-binding proteins having 15 636 DNA interacting and 298 503 non-interacting residues. Our trained models were evaluated on an independent dataset of 46 DNA-binding proteins having 965 DNA interacting and 9911 non-interacting residues. All proteins in the independent dataset have less than 30% of sequence similarity with proteins in the training dataset. A wide range of traditional machine learning and deep learning (1D-CNN) techniques-based models have been developed using binary, physicochemical properties and Position-Specific Scoring Matrix (PSSM)/evolutionary profiles. In the case of machine learning technique, eXtreme Gradient Boosting-based model achieved a maximum area under the receiver operating characteristics (AUROC) curve of 0.77 on the independent dataset using PSSM profile. Deep learning-based model achieved the highest AUROC of 0.79 on the independent dataset using a combination of all three profiles. We evaluated the performance of existing methods on the independent dataset and observed that our proposed method outperformed all the existing methods. In order to facilitate scientific community, we developed standalone software and web server, which are accessible from https://webs.iiitd.edu.in/raghava/dbpred.

Assuntos

Aprendizado Profundo , DNA/química , DNA/genética , Proteínas de Ligação a DNA , Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA